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Accession number & update 

0007388169 20051201. 

Title 

Architectural extensions to support efficient communication using message prediction. 
Conference information 

Proceedings 16th Annual International Symposium on High Performance Computing Systems and 
Applications, Moncton, NB, Canada, 16-19 June 2002. 
Source 

Proceedings 16th Annual International Symposium on High Performance Computing Systems and 

Applications, 2002, p. 20-7, 25 refs, pp. xi +314, ISBN: 0-7695-1626-2. 

Publisher: IEEE Comput. Soc, Los Alamitos, CA, USA. 
Author(s) 

Afsahi-A f Dimopoulos-N-J . 

Edltor(s): Almhana-3-N, Bhavsar-V . 
Author affiliation 

Afsahi, A., Dept. of Electr. & Comput. Eng., Queen's Univ., Kingston, Ont., Canada. 
Abstract 

With increasing uniprocessor and SMP computation power worlcstation clusters are becoming viable 
alternatives to high performance computing systems. Communication overhead affects the 
performance of parallel computers significantly. A significant portion of the software communication 
overhead is attributable to message copying. We argue that it is possible to address the message 
copying problem at the receiving side through speculation. We show that messages display a form of 
locality, and we introduce the notion of message prediction for the receiving side of message- 
passing systems. By predicting a receive communication call before it is posted, we are able to place 
the required message directly into the cache speculatively before it is needed so that effectively a 
zero-copy communication can be achieved. Specific extensions to the ISA and the processor 
architecture accommodate late binding without requiring copying of the message. 
Descriptors 
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PROCESSING . 
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Keywords 

SMP; workstation-clusters; high-performance-computing-systems; message-copying; parallel- 
computers; message-prediction; message- passing; zero-copy-communication. 
Treatment codes 

P Practi(:j.l. 
Language 

English. 
Publication type 

Conference 'Proceedings , 
Availability 

CCCC: 0-7695-1626-2/02/$17.00. 
Publication year 

2002. 
Publication date 

20020000. 
Edition 

2002037. 
Copyright statement 

Copyright 2002 lEE. 
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Title 

Pin-down cache: a virtual memory management technique for zero-copy communication. 
Conference information 

Proceedings of the First Merged International Parallel Processing Symposium and Symposium on 

Parallel and Distributed Processing, Orlando, FL, USA, 30 March-3 April 1998. 

Sponsor(s): IEEE Comput. Soc. Tech. Committee on Parallel Process; ACM SIGARCH. 
Source 

Proceedings of the First Merged International Parallel Processing Symposium and Symposium on 
Parallel and Distributed Processing (Cat. N0.98TB100227), 1998, p. 308-14, 9 refs, pp. xxv+809, 
ISBN: 0-8186-8404-6. 

Publisher: IEEE Comput. Soc, Los Alamltos, CA, USA. 
Author(s) 

Tezuka-H, 0-Carroll-F, Hori-A , Ishikawa-Y . 
Abstract 

The overhead of copying data through the central processor by a message passing protocol limits data 
transfer bandwidth. If the network interface directly transfers the user's memory to the network by 
issuing DMA, such data copies may be eliminated. Since the DMA facility accesses the physical memory 
address space, user virtual memory must be pinned down to a physical memory location before the 
message is sent or received. If each message transfer involves pin- down and release kernel 
primitives, message transfer bandwidth will decrease since those primitives are quite expensive. The 
authors propose a zero copy message transfer with a pin-down cache technique which reuses the 
pinned-down area to decrease the number of calls to pin-down and release primitives. The proposed 
facility has been Implemented In the PM low-level communication library on the RWC PC Cluster II, 
consisting of 64 Pentium Pro 200 MHz CPUs connected by a Myricom Myrinet network, and running 
NetBSD. The PM achieves 108.8 MBytes/sec for a 100% pin-down cache hit ratio and 78.7 MBytes/sec 
for all pin-down cache miss. The MPI library has been implemented on top of PM. According to the NAS 
parallel benchmarks result, an application Is still better performance in case that cache miss ratio is 
very high. 
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Publication year 
1998. 
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Accession number & update 

0007503575 20051201. 

Title 

The use of prediction for accelerating upgrade misses in cc-NUMA multiprocessors. 
Conference information 

Proceedings 2002 International Conference on Parallel Architectures and Compilation Techniques. PACT 
2002, Charlottesville, VA, USA, 22-25 Sept. 2002. 

Sponsor(s): IEEE Comput. Soc. Tech. Committee on Comput Archit; IEEE Comput. Soc. Tech. 
Committee on Parallel Process; ACM Special Interest Group on Comput. Architecture; IFIP WG 10.3; 
IBM; Microsoft Res; Univ. Virginia Dept. Comput. Scl. 
Source 

Proceedings 2002 International Conference on Parallel Architectures and Compilation Techniques. PACT 
2002, 2002, p. 155-64, 28 refs, pp. xviii+305, ISBN: 0-7695-1620-3. 
Publisher: IEEE Comput. Soc, Los Alamitos, CA, USA. 
Author(s) 

Acacio-M-E, Gonzalez-J, GarclaO-M. Duato-J . 
Author affiliation 

Acacio, M.E., Gonzalez, J., Garcia, J.M., Duato, J., Univ. de Murcia, Spain. 
Abstract 

This work is focused on accelerating upgrade misses In cc-NUMA multiprocessors. These misses are 
caused by store instructions for which a read-only copy of the line is found in the L2 cache. Upgrade 
misses require a message sent from the missing node to the directory, a directory lookup in order to 
find the set of sharers, invalidation messages being sent to the sharers and responses to the 
invalidations being sent back. Therefore, the penalty paid by these misses is not negligible, mainly if 
we consider that they account for a high percentage of the total miss rate. We propose the use of 
prediction as a means of providing cc-NUMA multiprocessors with a more efficient support for upgrade 
misses by directly invalidating sharers from the missing node. Our proposal comprises an effective 
prediction scheme achieving high hit rates as well as a coherence protocol extended to support the use 
of prediction. Our work is motivated by two key observations: first, upgrade misses present a repetitive 
behavior and, second, the total number of sharers being invalidated is small (one, in some cases). 
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Using execution-driven simulations, we show that the use of prediction can significantly accelerate 
upgrade misses (latency reductions of more than 40% In some cases). These important improvements 
translate into speed-ups on application performance up to 14%. Finally, these results can be obtained 
including a predictor with a total size of less than 48 KB in every node. 
Descriptors 

'0. CACHE-STORAGE; DELAYS: MEMORY- PROTOCOLS: f|, SHARED- MEMORY-SYSTEMS . 
Classification codes 

C5640 Protocols *: 

C522QP Parallel-architecture ; 

C612Q Hle.:orgaj.iisMQn. 
Keywords 

prediction; upgrade-miss-acceleration; cc-NUMA-multiprocessors; L2- cache; direct-invalidation; 

coherence-protocol; repetitive-behavior; sharers; execution-driven-simulations; latency-reductions. 
Treatment codes 

£ PractieM . 
Language 

English. 
Publication type 

Conference-proceedings. 
Availability 

CCCC: 1089-795X/02/$17.00. 
Digital object identifier 

10. 1 109/PACT.2002. 1 106014. 
Publication year 

2002. 
Publication date 

20020000. 
Edition 

2003002. 
Copyright statement 

Copyright 2003 lEE. 
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Title 

Architectural extensions to support efficient communication using message prediction. 
Conference information 

Proceedings 16th Annual International Symposium on High Performance Computing Systems and 
Applications, Moncton, NB, Canada, 16-19 June 2002. 
Source 

Proceedings 16th Annual International Symposium on High Performance Computing Systems and 

Applications, 2002, p. 20-7, 25 refs, pp. xl +314, ISBN: 0-7695-1626-2. 

Publisher: IEEE Comput. Soc, Los Alamitos, CA, USA. 
Author(s) 

Afsahl-A ^ Dimopoulos-N-J . 

Editor(s): Almhana-J-N, Bhavsar-V . 
Author affiliation 

Afsahi, A., Dept. of Electr. & Comput. Eng., Queen's Univ., Kingston, Ont., Canada. 
Abstract 

With increasing uniprocessor and SMP computation power workstation clusters are becoming viable 
alternatives to high performance computing systems. Communication overhead affects the 
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performance of parallel computers significantly. A significant portion of the software communication 
overhead is attributable to message copying. We argue that it is possible to address the message 
copying problem at the receiving side through speculation. We show that messages display a form of 
locality, and we introduce the notion of message prediction for the receiving side of message- 
passing systems. By predicting a receive communication call before it is posted, we are able to place 
the required message directly into the cache speculatively before it is needed so that effectively a 
zero-copy communication can be achieved. Specific extensions to the ISA and the processor 
architecture accommodate late binding without requiring copying of the message. 
Descriptors 
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PROCESSING . 
Classification codes 
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C6150N Dfstributed-systems'software; 

.C544 0 Multiprocessing-systems . 
Keywords 

SMP; workstation-clusters; high-performance-computing-systems; message-copying; parallel- 
computers; message-prediction; message- passing; zero-copy-communication. 
Treatment codes 

PPracdcal. 
Language 

English. 
Publication type 

Conference -proceedings . 
Availability 

CCCC: 0-7695-1626-2/02/$17.00. 
Publication year 

2002. 
Publication date 

20020000. 
Edition 

2002037. 
Copyright statement 

Copyright 2002 lEE. 
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Title 

Pin-down cache: a virtual memory management technique for zero-copy communication. 
Conference information 

Proceedings of the First Merged International Parallel Processing Symposium and Symposium on 
Parallel and Distributed Processing, Orlando, FL, USA, 30 March-3 April 1998. 
Sponsor(s): IEEE Comput. Soc. Tech. Committee on Parallel Process; ACM SIGARCH. 
Source 

Proceedings of the First Merged International Parallel Processing Symposium and Symposium on 
Parallel and Distributed Processing (Cat. No.98TB100227), 1998, p. 308-14, 9 refs, pp. xxv+809, 
ISBN: 0-8186-8404-6. 

Publisher: IEEE Comput. Soc, Los Alamitos, CA, USA. 
Author(s) 

Te2;uKa-H, 0-C^rroll-F , Horj-A , |shlkawa-Y . 
Abstract 
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The overhead of copying data through the central processor by a message passing protocol limits data 
transfer bandwidth. If the network interface directly transfers the user's memory to the network by 
issuing DMA, such data copies may be eliminated. Since the DMA facility accesses the physical memory 
address space, user virtual memory must be pinned down to a physical memory location before the 
message is sent or received. If each message transfer involves pin- down and release kernel 
primitives, message transfer bandwidth will decrease since those primitives are quite expensive. The 
authors propose a zero copy message transfer with a pin-down cache technique which reuses the 
pinned-down area to decrease the number of calls to pin-down and release primitives. The proposed 
facility has been implemented in the PM low-level communication library on the RWC PC Cluster II, 
consisting of 64 Pentium Pro 200 MHz CPUs connected by a Myricom Myrinet network, and running 
NetBSD. The PM achieves 108.8 MBytes/sec for a 100% pin-down cache hit ratio and 78.7 MBytes/sec 
for ail pin-down cache miss. The MPI library has been implemented on top of PM. According to the NAS 
parallel benchmarks result, an application is still better performance in case that cache miss ratio is 
very high. 
Descriptors 
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interface; user-memory-transfer; network; physical-memory-address-space; user-virtual-memory; 
physical-memory-location; release-kernel-primitives; pin-down-primitives; message-transfer- 
bandwidth; zero-copy-message-transfer; pinned-down-area-reuse; PM-low-level-communication- 
library; RWC-PC-Cluster-II; Pentium-Pro-CPUs; Myrlcom-Myrinet-network; NetBSD; pin- down-cache- 
hit-ratio; pin-down-cache-miss; MPI-llbrary; NAS-parallel- benchmarks; 78.7-MByte/s; 108.8- 
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Title 

Dynamic self-invalidation: reducing coherence overhead in shared-memory multiprocessors. 
Conference information 

Proceedings 22nd Annual International Symposium on Computer Architecture, Santa Margherita Ligure, 
Italy, 22-24 June 1995. 

Sponsor(s): ACM SIGARCH; IEEE Comput. See, TCCA; Univ. Genoa. 
Source 

Proceedings 22nd Annual International Symposium on Computer Architecture (IEEE Cat. 

NO.95CB35801), 1995, p. 48-59, 40 refs, pp. xili+426, ISBN: 0-89791-698-0. 

Publisher: ACM, New York, NY, USA. 
Author(s) 

Lebeck-A>R , Wood-D-A . 
Author affiliation 

Lebeck, A.R., Wood, D.A., Dept. of Comput. Sci., Wisconsin Univ., Madison, WI, USA. 
Abstract 

The paper Introduces dynamic self-Invalidation (DSI), a new technique for reducing cache coherence 
overhead in shared-memory multiprocessors. DSI eliminates invalidation messages by having a 
processor automatically invalidate its local copy of a cache block before a conflicting access by 
another processor. Eliminating invalidation overhead is particularly important under sequential 
consistency: where the latency of Invalidating outstanding copies can increase a program's critical 
path. DSI is applicable to software, hardware, and hybrid coherence schemes. We evaluate DSI in the 
context of hardware directory-based write-invalidate coherence protocols. Our results show that DSI 
reduces execution time of a sequentially consistent full-map coherence protocol by as much as 41%. 
This is comparable to an Implementation of weak consistency that uses a coalescing write-buffer to 
allow up to 16 outstanding requests for exclusive blocks. When used in conjunction with weak 
consistency DSI can exploit tear-off blocks-which eliminate both invalidation and acknowledgment 
messages-for a total reduction in messages of up to 26%. 
Descriptors 
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Abstract 

The Kyushu University reconfigurable parallel processor system is an MIMD-type multiprocessor which 
consists of 128 processing elements (PEs), interconnected by a full (128x128) crossbar network. The 
system employs reconfigurable memory architectures, a kind of local/remote memory architectures, 
and encompasses a shared-memory TCMP (tightly coupled multiprocessor) and a message-passing 
LCMP (loosely coupled multiprocessor). When the system Is configured as a shared-memory TCMP, 
memory contentions will be obstacles to the performance. To relieve the effects, the system provides 
each PE with a private unified cache. Each PE may have the cached copy of shared data in its cache 
whether it accesses to local or remote memory, and therefore the multicache consistency, or Inter- 
cache coherence, problem arises. The cache Is a virtual-address direct-mapped cache to meet the 
requirements for the hit time and size. The virtual-address cache implementation causes the other 
consistency problem, the 'synonym* problem, called the intra-cache coherence problem. The paper 
presents the cache coherence schemes that have been chosen for resolving these cache coherence 
problems. 
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